library(xcms)
library(tidyverse)
library(plotly)

Introduction

A fast recap about what we know about metabolites in LC-MS …

So every metabolite …

So …

Peak Picking

The process of finding chromatographic peaks in the data is called peak picking. It can be done in many different ways, and actually every software will do it slightly differently. The first step in the analysis of our dataset will be to pick the peaks in the full set of samples. Here I’ll show you the process on one file.

There are some important point to remember.

Working on a representative file will help you in fine tuning and benchmarking your parameters

Let’s start reeding in a raw file

raw_one <- readMSData(
  files = "../data/wines/x016_X_QC_X_4_NEG_DDA.mzML",
  msLevel. = 1, ## we read only MS1
  mode = "onDisk")  ## with this parameter the data are not loaded into RAM

I’ll now show to you how to perform peak picking with two algoritms availabe in xcms. In the 99% of the case you will use only one of them (CentWave), but it is nice - once in the life - to really put your hands in the machine

Peak picking: matched filter

The “older” and most sounding way of finding peaks implemented in xcms is the matched filter algorithm.

A full description of the parameters of the algorithm can be found in the xcms manual, here we focus on:

In xcms the parameters of the algorithm are stored into a specific object:

mf <- MatchedFilterParam(binSize = 0.1, 
                         fwhm = 6, 
                         snthresh = 5) 
mf
## Object of class:  MatchedFilterParam 
##  Parameters:
##  - binSize: [1] 0.1
##  - impute: [1] "none"
##  - baseValue: numeric(0)
##  - distance: numeric(0)
##  - fwhm: [1] 6
##  - sigma: [1] 2.547987
##  - max: [1] 5
##  - snthresh: [1] 5
##  - steps: [1] 2
##  - mzdiff: [1] 0.6
##  - index: [1] FALSE

Now I can use the previous parameters to find the peaks in one sample:

## this function is used to find the chromatographic peaks with the previous parameters
raw_one_mf_picked <- findChromPeaks(raw_one, param = mf)

raw_one_mf_picked 
## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.24 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 546 
##  MSn retention times: 0:01 - 11:60 minutes
## - - - Processing information - - -
## Data loaded [Wed Nov 16 11:00:42 2022] 
## Filter: select MS level(s) 1. [Wed Nov 16 11:00:42 2022] 
##  MSnbase version: 2.22.0 
## - - - Meta data  - - -
## phenoData
##   rowNames: x016_X_QC_X_4_NEG_DDA.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   x016_X_QC_X_4_NEG_DDA.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S0001 F1.S0004 ... F1.S1636 (546 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: matchedFilter 
##  638 peaks identified in 1 samples.
##  On average 638 chromatographic peaks per sample.

Ok, the software did his job. As you can see it was able to fine 638 peaks in this sample. As you can see the raw_one_mf_picked still holds peaks and raw data. The peak table can be extracted with a specific method

mf_peaks <- chromPeaks(raw_one_mf_picked) 
dim(mf_peaks)
## [1] 638  13
head(mf_peaks, 5)
##             mz    mzmin    mzmax       rt    rtmin    rtmax      into      intf
## CP001 50.01399 50.01241 50.04396 337.7074 334.5620 340.1007  358728.6  387501.5
## CP002 51.67812 51.67799 51.67819 306.2680 303.3541 311.2829  529555.9  428820.5
## CP003 53.00739 53.00735 53.00742 327.9940 326.5501 330.6442 1460325.7 1670480.7
## CP004 53.23135 53.23127 53.23137 303.3541 301.9032 306.2680  419943.3  519133.2
## CP005 54.56881 54.56869 54.56895 276.0245 273.0395 277.6028  183758.4  203009.4
##            maxo      maxf i        sn sample
## CP001  99313.48 102323.41 1  6.905211      1
## CP002 115270.37  94863.14 1  7.721503      1
## CP003 478205.50 461500.98 1 16.710695      1
## CP004 166208.92 141195.95 1  6.904875      1
## CP005  37604.99  45538.38 1  5.020740      1

Let’s walk to the most relevant columns:

Let’s now give a look to the position of the peaks in the mz/rt plane. The size of the point will be proportional to the intensity

mf_peaks %>% 
  as.data.frame() %>% 
  mutate(into = sqrt(into)) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, alpha = into, col = into)) + 
  scale_color_viridis_c() + 
  theme_bw()

If you go back to the previous demo, we were focussing on a specific area of the raw signal which was particularly promising

rt <- rtime(raw_one)
mz <- mz(raw_one)
I <- intensity(raw_one)
sub_peaks_mf <- mf_peaks %>% 
  as.data.frame() %>% 
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) 



ggplotly(tibble(rt = rt, mz = mz, I = I)  %>% 
  unnest(c("mz","I")) %>%
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) + 
  geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) + 
  scale_color_viridis_c() + 
  theme_light())

What we see:

This view gives an idea of the boundaries of the peaks

tibble(rt = rt, mz = mz, I = I)  %>% 
  unnest(c("mz","I")) %>%
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) + 
  geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) + 
  geom_segment(data = sub_peaks_mf, aes(x = rtmin, xend = rtmax, y = mz, yend = mz), col = "red") + 
  scale_color_viridis_c() + 
  theme_light()

So some lines are superimposed, some others not. Tricky business!

Peak Picking Cent Wave

Peak picking can also be performed with another algorithm: CentWave. The algorithm applied here is more clever and better suited for high resolution data. As we have seen in the lecture, it relies on the fact that the mass trace get stable in presence of a strong ionic signal

Also here many parameters (and others are not mentioned). I highlight here some of them:

cwp <- CentWaveParam(peakwidth = c(5, 30),    ## expected range of chromatographic peak width
                     ppm = 5,                 ## tolerance to identify ROIs in the mz/rt plane
                     prefilter = c(5, 50000),## number of consecutive scans showing a signal higher than 50000
                     noise = 5000)            ## minimum signal to be considered
cwp
## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 5
##  - peakwidth: [1]  5 30
##  - snthresh: [1] 10
##  - prefilter: [1]     5 50000
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 1
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 5000
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE

If we run the peak picking with this new algorithm…

raw_one_cw_picked <- findChromPeaks(raw_one, param = cwp)

raw_one_cw_picked 
## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.24 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 546 
##  MSn retention times: 0:01 - 11:60 minutes
## - - - Processing information - - -
## Data loaded [Wed Nov 16 11:00:42 2022] 
## Filter: select MS level(s) 1. [Wed Nov 16 11:00:42 2022] 
##  MSnbase version: 2.22.0 
## - - - Meta data  - - -
## phenoData
##   rowNames: x016_X_QC_X_4_NEG_DDA.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   x016_X_QC_X_4_NEG_DDA.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S0001 F1.S0004 ... F1.S1636 (546 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
##  567 peaks identified in 1 samples.
##  On average 567 chromatographic peaks per sample.

Ok here we see that the number of spotted peaks is different. If you think to it it is not strange: different method different results.

The first natural question you can ask is: where are you getting the parameters? Well, as I already told you at some point the “expert knowledge” enters the process. Reasonable guess for them is always coming from a good knowledge of the analytical pipeline. Optimal value could be optimized in an automatic way (like IPO does), but reasonable guesses are fundamental to restrict the quest for the optimal solution in a multidimensional space. So either you understand the analytics, or you should go and speak with you “analytical” colleague.

The second question. Are we getting comparable results?

cw_peaks <- chromPeaks(raw_one_cw_picked) 
dim(cw_peaks)
## [1] 567  11
head(cw_peaks, 5)
##             mz    mzmin    mzmax      rt   rtmin   rtmax     into     intb
## CP001 218.9821 218.9820 218.9822  3.1473  0.6727  8.1741 547614.9 547606.2
## CP002 218.9822 218.9820 218.9824 35.8160 29.5353 43.3485 637327.6 637313.8
## CP003 218.9819 218.9817 218.9823 11.9442  8.1741 16.9582 596261.7 596251.7
## CP004 218.9821 218.9818 218.9823 20.7363 16.9582 24.4925 462471.5 462462.7
## CP005 174.9562 174.9559 174.9564  3.1473  0.6727 11.9442 690276.5 690264.0
##           maxo    sn sample
## CP001 79109.52 79109      1
## CP002 65368.86 65368      1
## CP003 71700.71 71700      1
## CP004 60961.76 60961      1
## CP005 67842.15 67841      1
ggplotly(mf_peaks %>% 
  as.data.frame() %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz), col = "red", pch = 3, alpha = 0.5) + 
  geom_point(cw_peaks %>% as_tibble(), mapping = aes(x = rt, y = mz), col = "blue", pch = 1, alpha = 0.5) +
  theme_bw())

Different, isn’t it? In some cases the two algorithms are coherent, in others the results are markedly different. Centwave it is also giving this strange horizontal line at large rt

Let’s give a look to our subregion

sub_peaks_cw <- cw_peaks %>% 
  as.data.frame() %>% 
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) 



ggplotly(tibble(rt = rt, mz = mz, I = I)  %>% 
  unnest(c("mz","I")) %>%
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) + 
  geom_point(data = sub_peaks_cw, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) + 
  scale_color_viridis_c() + 
  theme_light())

Comments

  • For high resolution spectra as the one we have centWave is non to be more reliable results
  • In the first steps of your analysis you will tune-check-tune the parameters to find a reasonable
  • Your experience will drive this optimization: do you see molecules that should be there (standards, known compounds) in the peaklist?

Peak Picking all Dataset

In real life situations, when you are happy with your peak picking parameters you run the process on all your samples. xcms is designed for that, the only thing you have to do is to load more than on e file.

In view of the step in which the peak list will be merged together it is important also to add to the object containing the raw data any type of meta information which is associated to the samples. Typical meta information includes:

in xcms all these infos are stored in an AnnotatedDataFrame object

Typically all these info should be included in the filename, or in a csv which links the filename with the metadata. In the dataset we are using everything is in the file name, so let’s process them. In the following we will assume that the data are stored as mzML inside the data subfolder.

library(tools)  ## just to use file_path_sans_ext
phenodata <- tibble(filepath = list.files("../data/wines/", pattern = ".mzML", full.names = TRUE)) %>% 
  mutate(fname = file_path_sans_ext(basename(filepath))) %>% 
  separate(fname, into = c("inj_ord","color","variety","bottle","rep","polarity","mode"), 
           sep = "_", remove = FALSE) %>% 
  as(.,"AnnotatedDataFrame")
## Warning: Expected 7 pieces. Missing pieces filled with `NA` in 3 rows [15, 17,
## 19].
## to see the content 
pData(phenodata)
## # A tibble: 32 × 9
##    filepath               fname inj_ord color variety bottle rep   polar…¹ mode 
##    <chr>                  <chr> <chr>   <chr> <chr>   <chr>  <chr> <chr>   <chr>
##  1 ../data/wines//x001_X… x001… x001    X     blank   X      1     NEG     DDA  
##  2 ../data/wines//x002_X… x002… x002    X     QC      X      1     NEG     DDA  
##  3 ../data/wines//x003_X… x003… x003    X     QC      X      2     NEG     DDA  
##  4 ../data/wines//x004_r… x004… x004    red   sangio  B      1     NEG     DDA  
##  5 ../data/wines//x005_r… x005… x005    red   ptnoir  A      1     NEG     DDA  
##  6 ../data/wines//x006_r… x006… x006    red   merlot  A      1     NEG     DDA  
##  7 ../data/wines//x007_w… x007… x007    wht   vermen  A      1     NEG     DDA  
##  8 ../data/wines//x008_r… x008… x008    red   cannon  A      1     NEG     DDA  
##  9 ../data/wines//x009_w… x009… x009    wht   chardo  B      1     NEG     DDA  
## 10 ../data/wines//x010_w… x010… x010    wht   chardo  A      1     NEG     DDA  
## # … with 22 more rows, and abbreviated variable name ¹​polarity

Here we read the data in

raw_data <- raw_data <- readMSData(
  files = phenodata$filepath, 
  msLevel. = 1,
  pdata = phenodata, ## this is the structure of xcms holding phenotypic data
  mode = "onDisk")  ## with this parameter the data are not loaded into RAM

And perform the peak picking with centwave

register(SerialParam()) 
raw_data <- findChromPeaks(raw_data, param = cwp)
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 43 regions of interest ... OK: 20 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 718 regions of interest ... OK: 614 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 710 regions of interest ... OK: 634 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 734 regions of interest ... OK: 597 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 677 regions of interest ... OK: 568 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 810 regions of interest ... OK: 670 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 563 regions of interest ... OK: 524 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 756 regions of interest ... OK: 659 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 521 regions of interest ... OK: 497 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 733 regions of interest ... OK: 684 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 668 regions of interest ... OK: 488 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 742 regions of interest ... OK: 626 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 665 regions of interest ... OK: 593 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 44 regions of interest ... OK: 21 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 138 regions of interest ... OK: 104 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 712 regions of interest ... OK: 526 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 144 regions of interest ... OK: 103 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 707 regions of interest ... OK: 567 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 155 regions of interest ... OK: 127 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 538 regions of interest ... OK: 479 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 836 regions of interest ... OK: 609 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 509 regions of interest ... OK: 475 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 625 regions of interest ... OK: 666 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 741 regions of interest ... OK: 646 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 802 regions of interest ... OK: 598 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 715 regions of interest ... OK: 652 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 706 regions of interest ... OK: 533 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 707 regions of interest ... OK: 516 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 669 regions of interest ... OK: 639 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 39 regions of interest ... OK: 8 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 721 regions of interest ... OK: 677 found.
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 743 regions of interest ... OK: 612 found.

To spare time we save the picked object …

save(raw_data, file = "wines.RData")

Let’s give a look to the content of the full dataset

raw_data
## MSn experiment data ("XCMSnExp")
## Object size in memory: 4.84 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 17243 
##  MSn retention times: 0:01 - 12:00 minutes
## - - - Processing information - - -
## Data loaded [Wed Nov 16 11:01:13 2022] 
## Filter: select MS level(s) 1. [Wed Nov 16 11:01:13 2022] 
##  MSnbase version: 2.22.0 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1 2 ... 3 (32 total)
##   varLabels: filepath fname ... mode (9 total)
##   varMetadata: labelDescription
## Loaded from:
##   [1] x001_X_blank_X_1_NEG_DDA.mzML...  [32] x029_X_QC_X_6_NEG_DDA.mzML
##   Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
##   featureNames: F01.S0001 F01.S0004 ... F32.S1627 (17243 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
##  15732 peaks identified in 32 samples.
##  On average 492 chromatographic peaks per sample.

Here we are dealing with 29 files with different characteristics. After the peak picking optimization step we are confident that the parameters we have been choosing should give reasonably good results. Before going on, it is however useful to make some additional checks on the data quality.

The first thing to do is to monitor the TIC of the 29 experiments to spot potential drops of the signal during the analysis

tics <- chromatogram(raw_data, aggregationFun = "sum", 
                     include = "none")     ## this argument is required to avoid 
                                           ## including all the chromatograms of the picked peaks

We now plot them, with colors matching the sample class

mypalette <- c("steelblue", "coral", "darkgreen")
names(mypalette) <- c("red","X","wht")

plot(tics, col = mypalette[raw_data$color])
legend("topright", legend = c("red","Blank & QC","white"), col = mypalette, lty = 1)

The plot shows already nice things:

The trend with the injection can be visualized as follows

tc <- split(tic(raw_data), f = fromFile(raw_data))
full_tics <- sapply(tc, sum)


pData(raw_data) %>% 
  tibble() %>% 
  add_column(full_tics = full_tics) %>% 
  ggplot() + 
  geom_point(aes(x = inj_ord, y = full_tics, col = color), size = 2) + 
  theme_bw() + 
  theme(aspect.ratio = 0.3)

So no clear trend is visible in the data. X here represents QC and blanks.

Another type ov visualization that I find useful with reasonably small datasets is the following

chromPeaks(raw_data) %>% 
  as_tibble() %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(maxo)), size = 1, alpha = 0.5) + 
  scale_color_viridis_c() + 
  facet_wrap(~factor(sample)) + 
  theme_bw() + 
  theme(aspect.ratio = 0.5)

Here we see the peak maps of the different files. It is clear that each map is expected to be slightly different, but rally outlying samples will show up clearly. Look to the blanks, for example, or to the horizontal lines which are present in many files

DIY

Something for you now: